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Abstract — Multiple-input, multiple-output (MIMO) technology 
provides high data rate and enhanced QoS for wireless com- 
munications. Since the benefits from MIMO result in a heavy 
computational load in detectors, the design of low-complexity 
sub-optimum receivers is currently an active area of research. 
Lattice-reduction-aided detection (LRAD) has been shown to be 
an effective low-complexity method with near-ML performance. 
In this paper we advocate the use of systolic array architectures 
for MIMO receivers, and in particular we exhibit one of them 
based on LRAD. The "LLL lattice reduction algorithm" and 
the ensuing linear detections or successive spatial-interference 
cancellations can be located in the same array, which is con- 
siderably hardware-efficient. Since the conventional form of the 
LLL algorithm is not immediately suitable for parallel processing, 
two modified LLL algorithms are considered here for the systolic 
array. LLL algorithm with full-size reduction (FSR-LLL) is one 
of the versions more suitable for parallel processing. Another 
variant is the all-swap lattice-reduction (ASLR) algorithm for 
complex-valued lattices, which processes all lattice basis vectors 
simultaneously within one iteration. Our novel systolic array can 
operate both algorithms with different external logic controls. 
In order to simplify the systolic array design, we replace the 
Lovasz condition in the definition of LLL-reduced lattice with 
the looser Siegel condition. Simulation results show that for LR- 
aided linear detections, the bit-error-rate performance is still 
maintained with this relaxation. Comparisons between the two 
algorithms in terms of bit-error-rate performance, and average 
FPGA processing time in the systolic array are made, which 
shows that ASLR is a better choice for a systolic architecture, 
especially for systems with a large number of antennas. 

Index Terms — Lattice reduction, MIMO receivers, systolic 
arrays, wireless communications. 

I. Introduction 

MULTIPLE-INPUT, multiple-output (MIMO) technol- 
ogy, using several transmit and receive antennas in a 
rich- scattering wireless channel, has been shown to provide 
considerable improvement in spectral efficiency and channel 
capacity MIMO systems yield spatial diversity gain, spa- 
tial multiplexing gain, array gain, and interference reduction 
over single-input single-output (SISO) systems f2]. However, 
these benefits come at the price of a computational complexity 
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of the detector that may be intolerably large. In fact, optimal 
maximum-likelihood (ML) detection in large MIMO systems 
may not be feasible in real-time applications as its complexity 
increases exponentially with the number of antennas. Low- 
complexity receivers, employing linear detection or successive 
spatial-interference cancellation (SIC), are computationally 
less heavy, and amenable to simple hardware implementa- 
tion lO-llS]. However, diversity and error-rate performance of 
these low-complexity detectors are not comparable to those 
achieved with ML. 

Lattice-reduction-aided detection (LRAD), which combines 
lattice reduction techniques with linear detections or SIC, has 
been shown to yield some improvement on error-rate perfor- 
mance IS-lSl. Lenstra-Lenstra-Lovasz (LLL) algorithm |i9l 
is the most widely used lattice reduction algorithm, and can 
be applied to complex- valued lattices |10|. The performance 
of complex LLL-aided linear detection in MIMO systems 
was analyzed in ifTTl . LLL-based LRAD was also shown to 
achieve full receiver diversity (12] . It was also shown that the 
LR-aided minimum mean-square-error decoding achieves the 
optimal diversity-multiplexing tradeoff ITBl . When applied to 
MIMO detection, the average complexity of LLL algorithm is 
polynomial in the dimension of the channel matrix (the worst- 
case complexity could be unbounded 1 13 1). A fixed-complexity 
LLL algorithm, which modifies the original version to allow 
more robust early termination, has recently been proposed 
in IITtII . In LRAD, LLL algorithm need be performed only 
when the channel state changes. If the channel change rate 
is high, or a large number of channel matrices need be pro- 
cessed such as in a MIMO-OFDM system, a fast-throughput 
algorithm and the corresponding implementation structure is 
needed for real-time applications. To obtain this, we first 
discuss two variants of LLL algorithm, suitably modified for 
parallel processing. Second, we propose a novel systolic array 
structure implementing the two modified LLL algorithms and 
the ensuing detection methods. 

A systolic array flSl . |fT9l is a network of processing 
elements (PE) which transfer data locally and regularly with 
nearby elements and work rhythmically. In Fig. |l(a)[ a simple 
two-dimensional systolic array is shown as an example. In 
this case, the matrix operation D = A ■ B + C is calculated 
by the systolic array, where A, B, C and D are 2x2 
matrices. The operation of each PE is shown in Fig. |l(b)| 
The inputs of the systolic array, the entries of matrices A 
and C, are pipelined in a slanted manner for proper timing. 
Since all PEs can work simultaneously, the latency is shorter 
than with a single processor system, and the results of D 
are outputted in parallel. Systolic algorithms and the corre- 
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spending systolic arrays have been designed for a number of 
linear algebra algorithms, such as matrix triangularization [201 . 
matrix inversion [21] , adaptive nulling [22J, recursive least- 
square 1231 . Il24l . etc. An overview of systolic designs for 
several computationally demanding linear algebra algorithms 
for signal processing and communications applications was 
recently published in ||25]| . While systolic arrays allow simple 
parallel processing and achieve higher data rates without 
the demand on faster hardware capabilities, the existence of 
multiple PEs implies a higher cost of circuit area. Thus, time 
efficiency is traded off with circuit area in hardware design. 
For the application we are advocating in this paper (MIMO 
detectors), systolic arrays offer an attractive solution, as we 
must cope with a high computational load while requiring 
high throughput and real-time operation. Systolic arrays have 
been previously suggested for MIMO applications. In ||26| . the 
authors proposed a universal systolic array for adaptive and 
conventional linear MIMO detectors. In |27|, a reconfigurable 
systolic array processor based on coordinate rotation digital 
computer (CORDIC) |28| is proposed to provide efficient 
MIMO-OFDM baseband processing. Also, matrix factoriza- 
tion and inversion are widely used in MIMO detection, with 
systolic arrays used to increase the throughput f5\, f29]. 

In this paper, our objective is to provide a novel systolic 
array design for LLL-based LRAD. The ideas are described 
from a system-level perspective instead of detailed discussion 
on the hardware-oriented issues. The system model and how 
LRAD works are briefly described in Section |II] Since the 
original LLL algorithm f8l-fl5| is not designed for parallel 
processing, and hence is not suitable for systolic design, two 
modified LLL algorithms are considered here (Section HIIl l. 
Note that we are not claiming the two algorithms works 
better than the original LLL in terms of the LRAD bit-error- 
rate (BER) performance. First, we improve on the format 
of conventional LLL algorithm by altering the flow of size- 
reduction process (we call it "LLL with full size-reduction," 
or FSR-LLL). FSR-LLL is more time-efficient in parallel 
processing than the conventional format, and hence suitable 
for systolic design. We also consider a variant of the LLL 
algorithm called "all-swap lattice reduction (ASLR)," which 
was first proposed in f3U\ for real lattices, and derive its 
complex-number version. A crucial difference between ASLR 
and LLL algorithm is that with ASLR all lattice basis vectors 
are simultaneously processed during a single iteration. In both 
algorithms, in order to simplify the systolic array operations 
we replace the original Lovasz condition |9 | of LLL algorithm 
with the slightly weaker Siegel condition 131]. Surprisingly, for 
LR-aided linear detections the BER performance with Siegel 
condition under the proper parameter setting is just as good as 
the one using Lovasz condition. However, for LR-aided SIC, 
the performance with Lovasz condition is still slightly better 
due to less error propagation. The mapping from algorithm 
to systolic array is introduced in Section HV] In our design, 
ASLR and FSR-LLL can be operated on the same systolic- 
array structure, but the external logic controller is also required 
to control the algorithm flow. Additionally, since ASLR was 
originally designed for parallel processing, a systolic array 
running ASLR is on the average more efficient than one 
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Fig. 1. (a) Two-dimensional systolic array performing matrix calculation 
D = A ■ B + C , where a^j , bij , Cij , dij are the (j, j) entries of the matrix 
A,B, C, and D. (b) The operation of each processing element. 



running FSR-LLL. Simulation results also show that ASLR- 
based LRAD has a BER performance very similar to that of 
LLL algorithm. Comparison between our proposed design and 
the conventional LLL in FPGA implementation shows that 
the systolic arrays do provide faster processing speed with a 
moderate increase of hardware resources. After the channel 
state matrix has been lattice-reduced, linear detectors or SIC 
can also be implemented by the same systolic array without 
any extra hardware cost, which is discussed in Section |Vl 

The following notations are used throughout the remain- 
ing sections. Capital bold letters denote matrices, and lower 
case bold letters denote column vectors. For example, X = 
[xi , X2, • ■ • , x„j] is a matrix with m columns of xi to x,„. The 
entry of a matrix X at position {i,j) is denoted by Xij, and 
the fc*'' element of a vector x is denoted by Xk- The submatrix 



(subvector) formed from the a to b rows and m 



to 



W" columns of X is denoted by X.a:b,m:n- The notations 
(•)+, (•)"^, (•)^ and (•)^ are used for conjugate, transpose, 
Hermitian transpose, and Moore-Penrose pseudo-inverse of a 
matrix, respectively. ||xl| is the Euclidean norm of the vector 
X. ^{■) and are the real and imaginary parts of a complex 
number, respectively. \x\ indicates the closest integer to x. If 
a: is a complex number, then \x\ — \^{x)\ + i [3(a;)J . 
and 0„i are m x m identity and null matrices, respectively. 

II. Lattice-Reduction- Aided Detection 

A. System Model 

We consider a MIMO system with m transmit and n re- 
ceive antennas in a rich-scattering flat-fading channel. Spatial 
multiplexing is employed, so that data are transmitted as m 
substreams of equal rate. These substreams are mapped onto 
M-ary QAM symbols. Let x denote the complex-valued to x 1 
transmitted signal vector, and y the complex-valued n x 1 
received signal vector. The baseband model for this MIMO 
system is 

y = Hx + n, (1) 

where H is the nxm channel matrix: its entries are uncorre- 
cted, zero-mean, unit-variance complex circularly symmetric 
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Gaussian fading gains hij, and n is the n x 1 additive 
white complex Gaussian noise vector with zero mean and 
E[nn^] — a^I. The average power of each transmitted signal 
Xi is assumed to be normalized to 1, i.e., ^^[xx^] = I. 
Additionally, we assume that the channel matrix entries are 
fixed during each frame interval, and the receiver has perfect 
knowledge of the realization of H. 

B. Linear Detection 

In linear detection, the estimated signal x is computed by 
first premultiplying the received signal y by an n x m "weight 
matrix" W. The two most common design criteria for W are 
zero-forcing (ZF) and minimum mean-square error (MMSE). 
In zero-forcing detection, the weight matrix VizF is set to be 
the Moore-Penrose pseudo-inverse of the channel matrix 
H, i.e., 

^ZF = WzFY = HV = X + H^n. (2) 

It is known that zero-forcing detection suffers from the noise 
enhancement problem, as the channel matrix may be ill- 
conditioned. Under the MMSE criterion, the weight matrix 
W is chosen in such a way that the mean-squared-error 
between the transmitted signal x and the estimated signal x 
is minimized. The mean-squared-error (MSB) is defined as 
MSE ^ £;[||x-xf] = £;[(x- Wy)^(x- Wy)]. The 
weight matrix W that minimizes the MSE is 



n 



(H^H + a2l)-iH^, 



(3) 



Wmmsb 

It is well known that, as 0, the weight matrix 'Wmaise 

approaches Wzf- Since Wmmsb takes noise power into 
consideration, MMSE detection suffers less from noise en- 
hancement than ZF detection. In JS), ll32l . it is shown that 
MMSE is equivalent to ZF in an extended system model, i.e., 

^MMSE = WmmseY = HV = (H^H)-iH^y, (4) 



where 



H 



H 



andy 



y 

0-m X 1 



(5) 



Comparing (|2|i with (|4]i, it follows that the two detection 
methods can share the same structure in systolic-array im- 
plementation, which we shall elaborate upon in Section IIVI 

C. Lattice-Reduction-Aided Linear Detection 

The idea underlying lattice reduction is the selection of 
a basis vector for the lattice under some goodness crite- 
rion 133]. We first observe that, under the assumption of 
QAM transmission, the transmitted vector x is an integer point 
of a square lattice (after proper scaling and shifting of the 
original QAM constellation). By interpreting the columns of 
the channel matrix H as a set of lattice basis vectors, Hx is 
also a lattice point. If two basis sets H and H are related by 
H = H • T, T a unimodular matrix, they generate the same 
set of lattice points. In MIMO detection, the objective of the 
lattice reduction algorithm is to derive a better-conditioned 
channel matrix H. In this paper, we focus on the complex- 
valued LLL algorithm |[lO|, flT]. More details about the LLL 
algorithm will be provided in Section |III] 
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Fig. 2. Block diagram of linear lattice-reduction-aided detection 



After lattice-reduction of the channel matrix, we can per- 
form the linear detection, as described in Section III-BI based 
on H. Consider ZF first. The estimated signal x can be written 

as 

X = &y = & ((HT)(T-ix) + n) = T'^x + H+n. (6) 

Since x is no longer an integer vector, the simplest but subopti- 
mal way of estimating T^^x is to round x element- wise to the 
nearest integer Let Xq be an estimate of T^^x after rounding. 
The final step is to transform Xq back into an estimate of x, 
which is done by multiplying x^ by the unimodular matrix 
T. Since the vector entries after the transformation could lie 
outside the QAM constellation boundary, we finally quantize 
those points outside the boundary to the closest constellation 
point, i.e., xlr = Q(Txq). Fig. |2] shows the block diagram 
of LR-aided ZF detection for MIMO. It is easy to see that 
the same structure can also be used for MMSE detection, by 
simply replacing H and y with the extended matrix H and the 
vector y defined in (|5j, respectively. The remaining operations 
are the same as in ZF. 



D. LR-Aided Successive Spatial- Interference Cancellation 

Besides being suitable linear detection systolic design 
can be used to exploit the regularity of successive spatial- 
interference cancellation (SIC). In [8J, it is shown that LR- 
aided SIC outperforms linear detection methods, while ex- 
hibiting a complexity comparable to linear detection. The 
LR-aided SIC can be conveniently described in terms of the 
QR decomposition of the reduced channel matrix. Here we 
summarize briefly the procedure of LR-aided ZF-SIC only, as 
the LR-aided MMSE-SIC can be derived in a similar way. 
Let the QR decomposition of the reduced channel matrix be 
H = QR. First, multiply to y in ([T]i, we obtain 



V = Q^y = Rz + Q" n, where z = T 



(7) 



Then we can solve for z layer by layer starting from the bottom 
to the top, i.e. 



Zi 



V := V - (Ri:i,j)zj, 



(8) 



where i starts from m to 1 and Zi is the estimate of each entry 
of z. 

III. Two Variants of LLL Algorithm 

In this section, we introduce two variants of LLL algorithm 
which are more time-efficient than the classical LLL algorithm 
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when using parallel processing. Since systolic arrays yield a 
simple form of parallel processing, our systolic array design 
for LRAD is based on these two algorithms. 

We begin the discussion with the definition of LLL-reduced 
lattice. Let H (an nxm matrix) be a set of lattice basis vectors, 
with QR decomposition H — QR. The basis set H is complex 
LLL-reduced with parameter 5 {1/2 < 5 < 1), if the following 
two conditions are satisfied ifTOl . ifTTl : 

(a) 

and |3(Kj)|<i l<«<J<m,(9) 



(b) 



< 



2 < i < m. 



(10) 



The second condition in ( fTOl ) is called the Lovdsz condition, 
and the process to make the basis set satisfy (|9]l is called size 
reduction. In the standard form of LLL algorithm considered 
in the literature lISl- lfTSl . size reduction applies only to one 
column of H during a single iteration. Now, systolic arrays, 
allowing simple parallel processing, are capable of updating 
the whole matrix without introducing extra delays. Hence, our 
proposed systolic array is first designed based on the LLL 
algorithm in a different form, which we call it "LLL algorithm 
with full size reduction (FSR-LLL)." 

A. LLL algorithm with Full Size Reduction (FSR-LLL) 

Table H] shows the LLL algorithm with full size reduction. 
In the following discussion, we refer to the lines in Table 
U There are three main differences between FSR-LLL and 
the conventional complex LLL algorithrrQ, although the lattice 
reduced bases from both algorithms are still the same. First, 
the full size reduction (lines 4-10) is executed in each iteration 
of the while loop (line 3), which means that all columns of 
R and T are size-reduced at the beginning of each iteration. 
The advantage here is that, once condition ( fTOl i is also fulfilled 
after full size reduction (i.e., no k' is found in line 11), then 
the FSR-LLL can immediately end the process (Une 20). For 
example, suppose that k equals 3 at current iteration. Since all 
columns in R and T are size-reduced after full size reduction, 
if no k' can be found in line 11 (a search that a systolic array 
can make in parallel), then no further processing is needed 
in FSR-LLL. However, in the conventional LLL format, the 
process will end until columns 3 to m are sequentially size- 
reduced. With a systolic-array implementation, FSR-LLL is 
faster, and its efficiency is especially apparent when m is large. 
The second difference is that the Givens rotation (lines 13-16) 
is executed before the column swap (line 17). This is because 
the Givens rotation process can work in parallel with full size 
reduction, whereas the columns swap cannot. This point will 
be made clear in Section IIV-AI Third, the QR decomposition 
Q^H = R is considered as the input of the algorithm, instead 

'For comparison, the interested readers can refer to the Table I in fllj for 
the conventional complex LLL algoiithm. The Table III and ITll in this paper are 
presented in the similar format as the one in 111]. All the simulation results 
related to the conventional LLL in this paper are also based on the same table. 



TABLE I 

LLL Algorithm with Full Size Reduction 
INPUT Q",R 

OUTPUT Q''=Q'',R = R, T 

( 1 ) Initialization T = I 

(2) k = 2 

(3) While k < m 

Full Size Reduction 



(4) 
(5) 
(6) 
(7) 
(8) 
(9) 



for j = m,---,2 
for i = y-l,---,l 

R,:,,,:=Rl:,,-/^,.,R,:,.. 

end 



(10) end 



(11) Find the smallest k' between k ~ m 



(12) 

(13) 
(14) 

(15) 
(16) 

(17) 

(18) 
(19) 
(20) 



suchthat<y-|/;,._,j,,/r,,,_,j,_,| 
If k' exists 

Givens Rotation 



1, =''r-i.r/||'V-i:, 

ii = n-'j' /\\n-'-,±: 

r'h '/i 

'-i-'-l:*',;,-'- 



G 



G R, 



Column Swap 



Swap columns k' -I and k' in R and T 

k := max {A:' -1,2} 
else 

k := m + l 



(21) end 

(22) end 



of H = QR. From line 16, the Givens rotation matrix G 
appUes to the same two rows of and R, which simplifies 
the design of the systolic array. Additionally, after FSR-LLL, 
is ready for calculating the pseudoinverse of H for hnear- 
detection. 



B. All-Swap Lattice Reduction (ASLR) Algorithm 

The ASLR algorithm is a variant of the LLL algorithm, and 
was first proposed for real number lattices only |30|. Table HIl 
describes its extension to a complex version. One significant 
difference between FSR-LLL and ASLR is that every pair 
of columns k and fc — 1 with even (or odd) index k could 
be swapped simultaneously. The algorithm begins with full 
size reduction, which is the same as FSR-LLL. Givens-rotation 
and column-swap operations (same as in TableHl lines 13-17) 
should be executed on all possible even (odd) k that violate 
the condition in ( fTOl l. and then start another iteration with the 
indicator variable ''order" set to odd (even). If condition ( fTOl l 
holds for all even (odd) fc, Givens rotation and columns swap 
will not be executed. Meanwhile, we can immediately check 
for all odd (even) fc instead. Matrix R is already full-size 
reduced, with no need to start the next iteration with full size 
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TABLE II 

All Swap Lattice Reduction Algorithm 



and it is weaker than the Lovasz condition because 



INPUT Q",R 

OUTPUT Q"=Q",R = R, T 
(1) 
(2) 
(3) 



1 



<6 



■■I... 



Initialization T = 
order=EWEN 

While (any swap is possible in lines (9) or (16) ) 
Full Size Reduction 



(4) 

(5) 
(6) 
(7) 
(8) 
(9) 



(10) 

(11) 
(12) 

(13) 
(14) 
(15) 
(16) 



(17) 
(18) 

(19) 



Execute lines 4 ~ 1 in Table I 



Givens Rotation and Column Swap 



liorder=EWEN 

go to line (13) 
else 

Execute lines 13~17 in Table I 
for all even k between 2 ~ m 



such that <y-|//j. , J 
order = ODD 
end 
else 



for all even k 



r,,_ 



for all odd k 



go to line (6) 
else 

Execute lines 13~17 in Table I 
for all odd k between 2~ m 

such that S - , |' > |r, , |7|r,_, f 
order=EVEN 
end 
end 



(20) end 



reduction (Table [III line 7 or 14). If neither an even nor odd 
k violates condition ( fTOl ) after full size reduction, the ASLR 
process ends. 

C. Replacing Lovasz condition with Siegel condition 

From the previous discussion, it is clear that all basis vectors 
are size reduced within one processing iteration of full size 
reduction. Additionally, according to line 1 1 in Table |I] and 
lines 6 and 13 in Table HH the lattices processed by FSR-LLL 
and ASLR both satisfy the Lovasz condition in ( fTOl i. There- 
fore, we can conclude that these two algorithms also generate 
LLL-reduced lattice. Consequently, Uke the conventional LLL, 
FSR-LLL-aided and ASLR-aided detection also achieves full 
receive diversity in MIMO system IfTTI . lfT2l . 

The Lovasz condition involves two diagonal elements and 
one off-diagonal element in the matrix R. In order to simplify 
the data communication between processing elements in the 
systolic array, we relax the Lovasz condition by replacing it 
with ^ 

S-l<—hA — j,2<i<m, (11) 

where S lies in the range (1/2,1), the same as for Lovasz con- 
dition. The condition ( fTTT i is also called Siegel condition ||3TI . 



< 



2<i<m. (12) 



\ri-i.i-i\ 

The first inequality follows from (|9|l. Similar approximation 
as in (fTTT i can be found in |34|. The advantage of using this 
new condition is that only two neighboring diagonal elements 
of R are involved. We will have more discussion on the 
impact of designing systolic array with this new condition 
in Section |IV] Another advantage comes from the fact that 
the new condition check can be done by taking the square- 
root in (fTTl l. In hardware implementation, it implies that we 
can save precision bits by storing |ri_i|/|ri_i rather than 
ki.il^y ki-i.i-i 1^- Additionally, the condition check can be 
done without a division, simply by comparing the value of 
\r.i^i\ and ^5-1/2 |ri_i^i_i| , where -1/2 is a pre- 
computed constant once 6 is determined. In the balance of 
this paper, when we refer to FSR-LLL and ASLR we mean 
FSR-LLL and ASLR with Siegel condition. 

Since Siegel condition is weaker than Lovasz condition, 
one might expect the performance of the lattice reduction 
algorithm with condition ( fTTT i to be worsened. Yet, by a 
proof similar to that in IfTTI . lfT2l we can show that the 
LLL algorithm with Siegel condition also achieves maximum 
receive diversity in MIMO systems. In the proof of LLL-aided 
detection achieving full diversity IfTTI . lfT2l . the key step and 
the only step involving the LLL-reduced conditions is that the 
orthogonality defect k {k > 1) of the LLL-reduced basis set 
H is upper bounded by 



2S-1 



(13) 



det (H^H) 

where h^'s are the columns of H. In particular, ( fT3f ) also 
holds for the lattices reduced by LLL algorithm with Siegel 
condition. This can be justified by the same proof as in ifTTf 
Appendix B], whose details will be omitted in this paper 
Hence, the LLL algorithm with the Lovasz condition replaced 
by the Siegel condition also achieves maximum diversity in 
MIMO system. However, achieving maximum receive diver- 
sity does not automatically imply that the bit-eiTor-rate (BER) 
performance is as good as using the conventional LLL algo- 
rithm. One can easily observe that if S is very close to 1/2 , 
condition ( fTTT i is almost always true. Thus, the Givens rotation 
and column swap steps in the reduction algorithm would 
seldom be performed, which causes the BER performance to 
be much worse than with conventional LLL. On the contrary, 
as S approaches 1 one can expect the performance of FSR- 
LLL and ASLR to be closer to the conventional LLL. In Fig.|3| 
we show the empirical cumulative probability functions of the 
orthogonality defect k for 4x4 channel matrices under three 
different reduction algorithms. The results of FSR-LLL and 
ASLR overlap for all three values of S, which implies that the 
effects of these two method on lattice reduction are almost the 
same. As S = 0.99, FSR-LLL and ASLR give a result close 
to the LLL with S = 0.75, which is a very common setting 
as documented in previous works fS], ||9l, lfT2l . For 5 = 0.51 
and 0.75, the gap between LLL and FSR-LLL (ASLR) is much 



6 



TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS 




5 10 15 

Orhtogonality defect k 



Fig. 3. The empirical cumulative probability functions of the orthogonality 
defect K for the 4x4 channel matrices under three different reduction 
algorithms. 



larger than for & = 0.99. In section ITV-CI we will show that 
for 5 equal to 0.99, the BER performance of LR-aided linear 
detections using FSR-LLL and ASLR is not worse than the one 
using the conventional LLL with the same 6 value. Based on 
these results, in our systolic array design we choose 6 — 0.99. 

IV. Systolic Array for Two Lattice-Reduction 
Algorithms 

From Fig. |2] the whole process of LRAD can be viewed 
as taking two steps: lattice reduction for the channel matrix, 
and detection. In this section, we exhibit our systolic array 
design for LLL lattice reduction algorithm. The ensuing linear 
detection or SIC on systolic array will be discussed in Section 
rVl In the following discussion, we assume that the channel 
matrix has been QR decomposed. It is known that QRD 
can be implemented in systolic array based on a series of 
Givens rotations, since Given rotations can be executed in 
a parallel manner ||20| - ||22| . Since the conventional systolic 
array for QRD usually contains square root operations, which 
are computationally intensive in hardware implementation, 
a square-root-free systolic QRD based on Squared Givens 
rotations (SGR) can be used (the interested readers can refer 
to m, (Ml). In H, it is also shown that the sorted QRD 
(SQRD) can reduce the number of column swaps in the LLL 
algorithm, and hence leads to less processing time. However, 
it also requires higher hardware complexity and latency to 
implement SQRD than the conventional QRD 1361 . 

A. Systolic Array for FSR-LLL 

In the following, we assume a 4 x 4 MIMO system (i.e., 
m = 4, n = 4) and illustrate the proposed systolic algorithm 
in three parts: full size reduction, Givens rotation, and column 
swap. 

7 ) Full Size Reduction: The systolic array for the remaining 
parts of LRAD is shown in Fig. |4(a)| . Four different kinds of 
PEs are used, viz., diagonal cells, off-diagonal cells, vectoring 
cells, and rotation cells. For the full size reduction part, only 



diagonal and off-diagonal cells are needed: the operations 
of these two types of PEs are shown in detail in Fig. |4(b)| 
The vectoring cell and rotation cell will be introduced with 
the Givens rotation description. There is a slight difference 
between the off-diagonal cells in the upper-triangle part and 
those in the lower-triangle part. Fig. |4(b)| shows only the off- 
diagonal cell in the upper-triangle part. Those off-diagonal 
cells in the lower-triangle part have yi„ and Cm come from 
the top, while Cout leaves from the bottom. Except for this 
minor difference in the data interface, the operations are 
the same as the off-diagonal cells in the upper-triangle part. 
Additionally, in Fig. |4(b)| the dotted lines represent the logic 
control signals transmitted between cells, and the solid lines 
represent the data transmitted. To initialize the process, each 
element of the matrices R and (denoted as r and q, 
respectively, in Fig. |4(b)[ ) from QR decomposition are stored 
in the PE at the corresponding position. For example, ^ and 
r; i are stored in the corresponding diagonal cell Da. The off- 
diagonal elements qi,j and ^ are stored in the off-diagonal 
cell Oij . Additionally, the elements of the unimodular matrix 
T (denoted as t in Fig. |4(b)[ ) are also stored in the arrays, with 
T initially set to the identity matrix. 

Fig. |5] shows the overall processes of the full size reduction 
in the systolic array. In this stage, two major processing modes 
are defined in each diagonal and off-diagonal cell, the size 
reduction mode and the data mode as detailed in Fig. |4(b)| In 
the size reduction mode, the objective of each cell is to make 
condition ^ valid. On the other hand, the cell only performs 
data propagation in the data mode. The cell decides to work in 
either mode depending on the occurrence of the logic control 
signal "#". For simplicity, we assume the cells execute all 
operations in the data mode or the size-reduction mode in one 
normalized cycl^. At T = 0, the external controller sends in 
the logic control signal "#" to cell 1)33 through cell D44. At 
T = 1, cell 1)33 works in the data mode due to the control 
signal "#" and spreads out the "#" logic control signal to 
the neighboring 3 cells. Meanwhile, 1)33 sends out the data 
('^3,3, ^3,3)^*' to cell O34. Note that the superscript (*) is a tag 
bit attached to the data, which indicates that the data are sent 
out by a diagonal cell. The occurrence of a tag bit (*) will drive 
the off-diagonal cell to compute /i, and use /i to update the data 
stored in that cell. As a result, at T = 2, cell O34 sends out 
the newly computed n to the two neighboring cells O24 and 
1)44. At next time instant (T = 3), the /i signal generated by 
O34 meets the data coming from cell O23 (O43) inside the cell 
024(7544), and executes the size reduction update. At the same 
time instant, data [r 2 ^2^^2,2)^*^ enter cell 023- As cell O34 
did at r = 2, cell O23 computes updates (r2,3,i2,3), and 
sends out /i to the neighboring cells O13 and Z?33. The most 
important fact here is that cell O23 also propagates the data 
('^2,2, ^2,2)'-*' to cell O24, and thus starts the column operations 
between column 2 and column 4 at T = 4. Similarly, the 
column operations between column 1 and column 4 begins at 
T = 6 as (^2,2, ^2,2)''*'' enter cell O14. Essentially, full size 
reduction is a series of column operations between column j 
and columns j — 1, j — 2, • • • ,1, for all 2 < j < m, and we 

-The real hardware cycle counts could be multiples of the normahzed cycle. 
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Fig. 4. (a) The systolic array for the linear LRAD of 4 X 4 MIMO system, 
(b) The operations of diagonal and off-diagonal cells in the systolic array. 
("*" is an indicator bit used to control the flow of the algorithm, as explained 
in Section llV-At 



can conclude the following facts for an m x m MIMO system: 
[Fact 1] In this systolic flow, the column operation between 
column j and column i(i < j) begins at T — m + j — 2i as 
{ri^i,ti^iY*'^ enters cefl O^. 

Proof: Data {ri^i,ti^i)^*^ leaves cell Da at T = m — i, 
and it takes j — i cycles to have {ri^i^ti^i)^*^ propagates from 
cell Da to cell Oij. ■ 
[Fact 2] All column operations on column j end at T = 
2m + j — 3 in cell Omj ■ 

Proof: In this systolic flow, the last column operation on 
column i is always between column j and column 1, which 
starts at T = TO + J — 2 in cell Oij according to fact 1. It 
takes TO — 1 more cycles to propagate /i from cell Oij to cell 
Omj and finish the column operation. ■ 



S?S*E 




Data mode 



^ Size Reduction mode 



Fig. 5. Flow chart of the full size reduction operations in the systolic array. 



[Fact 3] The full size reduction ends at T = 3to — 3, when 
all updates on column to are done. 

Proof: The full size reduction ends when column to finish 
all the column operations. Therefore, it follows the result in 
fact 2 that the last step is at T = 3to — 3. ■ 
Referring back to the example mentioned in Section IIII-AI 
we can have a more concrete view about the advantage of 
FSR-LLL over the conventional LLL form when a systolic 
array is used. If FSR-LLL is applied, the systolic array takes 
a total of 3to — 3 cycles to end the all processes. However, 
for non-systolic LLL, it takes 2to + — 3 to process column 
j, and all column operations cannot be done in parallel. So 
the total time to perform size reduction in non-systolic LLL 
would be X^fLs (2™ + J ~ 3) — 2.5m? — 6.5to + 3 cycles 
in that example. In this case, as m increases beyond 3, the 
advantage of FSR-LLL over the conventional format becomes 
significant. 

2) Givens Rotation: As mentioned in Section IIII-CI we 
use Siegel condition in the lattice reduction algorithm, which 
only relates two r elements in the neighboring diagonal 
cells. Hence, this condition can be checked during a full 
size reduction step. For example, in Fig. |5] at T = 1, 
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Fig. 6. The operations of vectoring cells and rotation cells in the systolic 
array. 



cell sends data 3 to cell D22 along with the "#" 
signal. At the next time instant, cell D22 will check this 
condition based on Irs^sf / 1^2,2!^, and also generate the logic 



control signal "swap" (see Fig. |4(b)| i. If (5 — 1/2 is greater 
than |ri_i.i_i 1^ then "swap" is "true", and drives 

the vectoring cell to work. The operations of vectoring and 
rotation cells are shown in Fig. |6] The vectoring cell zeros out 
the input data (3 by the Givens rotation matrix G, which is 
calculated based on Table |T] lines 13 to 15. The rotation cell 
simply rotates the input data with the angle Q given by the 
neighboring vectoring cell. Hence, the vectoring and rotation 
cells also work in a systolic way, with the rotation angle 
Q propagating between cells. As shown in Fig. |4(a)| there 
are 3 rotation cells and 1 vectoring cell between every two 
consecutive rows of the systolic array. These cells perform the 
Givens rotation to the R and data in those two rows. The 
vectoring cell is located between cells Da and Oi-i^i because 
the Givens rotation step is executed prior to the column-swap 
step in FSR-LLL, and data ; need be zeroed so that the 
matrix R is still upper triangular after column swap. 

Note that Givens rotation only applies to rows k' and k' — 
1 during one iteration of FSR-LLL if k' exists (lines 13-16 
in Table However, every Da (i = 1, • • • ,to — 1) could 
generate the "swap" signal during the full size reduction step. 
Therefore, we need a direct access from the external controller 
to each diagonal cell in order to control the data path between 
the diagonal cell and the vectoring cell. Namely, only cell 
Dk'k' can pass the signal "swap" to the vectoring cell and 
perform the Givens rotation to rows k' and k' — 1. In Fig. |4(a)[ 
we use a "switch" symbol between each pair of a diagonal cell 
and a vectoring cell to represent the control by the external 
controller. Only one switch is turned on during one iteration. 

Additionally, a Givens rotation on rows k' and fc' — 1 
can begin right after r^'-i^k' is updated during the full size 
reduction step. For example, ra 4 is updated at T = 2 as shown 
in Fig. |5] and Givens rotation on rows 3 and 4 could start as 
early as T = 3 without any interference to the remaining 
operations of full size reduction. This way, the time necessary 
to perform Givens rotations can be partially hidden by the 
full size reduction and this is the reason why we want the 
Givens rotation to occur prior to column swap in our design. 
For hardware implementation, one could consider using only 
one rotation cell between every two neighboring rows or the 
systolic array to reduce the hardware complexity. This will not 
lead to significant increase in time if we consider performing 
Givens rotation and full size reduction in parallel. 



3) Column swap: The columns fc' and fc' — 1 of R (and T) 
should be swapped, after the Givens rotation is done. However, 
it is possible that the column swap be partially overlapped in 
time with size reduction and Givens rotation. For example, the 
column swap could begin after R being rotated but prior to 
being updated since there is no need to swap columns of 

The FSR-LLL stops when there is no possible column swap, 
i.e., a fc' in Table |I] line 11, does not exist. The system flow 
(lines 3, 18 and 20 in Table |I]i is controlled by the external 
processor. The lattice reduced matrices R and and the 
unimodular matrix T stay in the PEs. The systolic array 
along with these matrices will be used for linear detection, 
as described in Section IVl below. 

B. All-Swap Lattice Reduction (ASLR) Algorithm 

The ASLR algorithm can also be performed by the systolic 
array shown in Fig. |4(a)| The process of full size reduc- 
tion is the same as in Fig. |5] During full size reduction, 
the Siegel condition is also checked in each diagonal cell 
Dii~Dm-i,m-i- If the current value of "order" is even (odd), 
then the "switch" between each cell Dk-i^k-i with even (odd) 
index fc and the vectoring cell is turned on by the external 
controller. Consequently, for every even (odd) index fc, Givens 
rotation between rows fc — 1 and fc could be executed if 
needed. As for the column swap step, more than one pair 
of columns could be swapped during one iteration, but all 
these pairs are swapped in parallel. Hence, the time spent 
on columns swap is the same as on swapping a single pair 
of columns. Based on this observation, we can expect the 
systolic ASLR to work more efficient than the systolic FSR- 
LLL. Comparisons between these two algorithms in terms of 
bit-error-rate performance and of efficiency in execution time 
are deferred to the next subsection. 

Note that in our description we limit the applications of 
this systolic array only to an to x m MIMO system. For 
TO X m MMSE-LRAD, although the matrix is to x 2to 
(the extended channel model in (|5]l), we can treat the subma- 
trix Q^„j (m+i)-2m another square matrix, and store each 
element of (m+i)-2m ^'^ '■^^ corresponding 

position. Namely, qij and Qij+m should be stored in the same 
PE, which still keeps the systolic array square. 

C. Comparison between FSR-LLL and ASLR algorithm 

First, we compare the two algorithms in term of bit-error- 
rate (BER) performance, and also compare them with the 
conventional LLL algorithm. In our simulation, 4-QAM is 
assumed for the transmitted symbols. The constant 5 is set to 
0.99 in all algorithms for fair comparison. Let Ei, be defined 
as the equivalent energy per bit at the receiver, and thus 
Eb/No is to/(ct2 log2 M). The Fig. [7(a)l shows the BER results 
of minimum mean-square-error LRAD (in 4 x 4 and 8x8 
MIMO systems) based on FSR-LLL (denoted as MMSE-FSR), 
ASLR algorithm (denoted as MMSE-ASLR) and the LLL 
algorithm (denoted as MMSE-LLL). The BER results for ML 
detection and MMSE without lattice reduction are also shown 
for comparison. As 5 = 0.99, the FSR-LLL and ASLR work as 
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Fig. 7. BER performance of FSR-LLL and ASLR- based MMSE LRAD. 
(a)Linear detection (4 X 4 and 8 X 8 MIMO systems) (b)SIC (an 4 X 4 MIMO 
systems) 



well as LLL algorithm, and even slightly better in the case of 
TO = 8. It clearly shows that using the insignificantly weaker 
Siegel condition does not deteriorate the BER performance 
of linear detections in an MIMO system as compared to the 
conventional LLL. In Fig. |7(b)[ the BER performance of an 
4x4 MIMO system using LR-aided MMSE SIC based on 
different lattice reduction algorithms are shown. Unlike the 
linear detection case, the LLL-aided SIC works better than the 
other two algorithms. Since the detection of the first layer in 
SIC dominates the overall performance, it implies that due to 
Siegel condition the FSR-LLL -reduced or the ASLR-reduced 
channel provides lower SNR for the first layer in SIC than 
the one given by the conventional LLL. Additionally, FSR- 
LLL and ASLR lead to almost the same results in all three 
MIMO systems, which is consistent with the results in Fig. [3] 
Hence, we can conclude that although FSR-LLL and ASLR 
give different lattice reduced matrices, the LRAD based on 
these two algorithms have very similar BER performance. 

Next, we compare the efficiency of the systolic array for 
both algorithms. It is known that the number of iterations 
of FSR-LLL and ASLR depends on the condition number of 
the channel matrix. If H is well-conditioned, lattice reduction 
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m 



Fig. 8. The average number of column swaps in FSR-LLL, ASLR and LLL- 
aided MMSE detection in m X m MIMO system with Et/No fixed at 20 
dB. 
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Fig. 9. The average number of floating point operations in FSR-LLL, ASLR 
and LLL-aided MMSE detection in m X m MIMO system with E^/No fixed 
at 20 dB. 



takes less iterations, and thus less cycles in the systolic array. 
Since both algorithms begin with full size reduction, the total 
execution time is fully determined by the number of column 
swaps in the overall process. Less column swapping implies 
less iterations. Fig. [8] shows the average number of column 
swaps in FSR-LLL and ASLR-aided MMSE detection (with 
Eb/No fixed at 20dB) in to x to MIMO systems (to = 4-16). 
Note that for ASLR we count all the even or odd columns 
swaps during one iteration as only one swap since they are 
executed in parallel. In an 4 x 4 MIMO, the difference between 
the two algorithms is almost negligible. However, as the 
number of antennas grows, the advantage of ASLR becomes 
significant. For to > 8, ASLR has less than 65% the column 
swaps comparing to FSR-LLL. Based on BER performance 
and time-efficiency comparisons, ASLR should be a better 
algorithm to be applied on our systolic array, especially with 
a large number of antennas. 

For comparison, the results of the conventional LLL with 
S = 0.99 and 0.75 are also shown in Fig. [8] As expected. 
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reduction algorithms using ZF-SIC in an 4 X 4 MIMO system 



LLL with S ~ 0.99 has a higher complexity than LLL with 

5 = 0.75. Furthermore, the conventional LLL has a much 
higher average number of column swaps than FSR-LLL and 
ASLR have in the higher-dimensional MIMO system (m > 8). 
However, it is not fair to conclude that the complexities of 
FSR-LLL and ASLR are much lower than the conventional 
LLL; in fact, full size reductions are performed in the former 
two algorithms, and full size reduction needs more computa- 
tion efforts than the conventional size reduction in LLL. In 
Fig. |9] we compare the number of floating point operations 
(flop) in LLL, FSR-LLL, and ASLR using the same settings 
as in Fig. [8] The flops are counted in terms of number of 
real additions and real multiplications. One complex addition 
is counted as two flops (two real additions) and one complex 
multiplication is counted as six flops (four real multiplications 
and two real additions). The complexity of QR decomposition 
is neglected, since this is done only once at the beginning 
of the three algorithms. It is shown that LLL with 6 = 0.99 
has the highest complexity among the three. Under the same 

6 (= 0.99) setting, FSR-LLL and ASLR have a much lower 
computational complexity than LLL. On the other hand, the 
complexity of LLL with 6 ~ 0.75 is just slightly higher than 
FSR-LLL and ASLR, even though the average number of 
column swaps of LLL with 6 = 0.75 is more than two times 
larger than the one of ASLR for m > 10. This implies that 
the process of full size reduction introduces some additional 
complexity. However, thanks to the (insignificantly) weaker 
Siegel condition, the complexities of ASLR and FSR-LLL for 
TO > 10 are less than 50% of the complexity of LLL with the 
same S setting. 

To further explore the advantage of using systolic array, 
we implement our proposed architecture for an 4 x 4 MIMO 
system onto FPGA. We performed our design using Xilinx 
System Generator 11.5 (XSG) block-set in the Simulink de- 
sign environment. A Verilog Hardware Description Language 
(HDL) code is then generated automatically by XSG and is 
synthesized by Xilinx XST The place and route is done by 
Xilinx ISE 11.5. The word-length of R, Q^, T and ^ are set 
to (18,13), (14,13), (8,0) and (3,0), respectively. As mentioned 
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TABLE 111 

FPGA Implementation Results 



Target 
Algorithm 


ASLR 


FSR-LLL 


CLLL [14] 


Device 


Virtex 5' 


Virtex 6" 


Virtex 5' 


Virtex 6' 


Virtex 4 


Virtex 5 


Slices 


2322 
/20480 


1812 
/20000 


2335 
/20480 


1798 
/20000 


3617 
/67584 


1712 
/17280 


Clock 
Frequency 


160MHz 


249MHz 


155MHz 


247MHz 


140 
MHz 


163 MHz 


Avg. 
cycles(time) 
per channel 
matrix 


80 (SQRD) 


84 (SQRD) 


130 (SQRD) 


500.0ns 1 321.3ns 


541.9ns 1 340.1ns 


146 (QRD) 


164 (QRD) 


928.6ns 


797.5ns 


912.5ns 1 586.3ns 


1058.1ns 1 664.0ns 



'part number: XC5VFX130T "part number: XC6VLX130T 



in Section ITlI-CI the division in Siegel condition check can be 
avoided by using a comparator The divisions in the Givens 
rotation are implemented by the Newton-Raphson iterative 
algorithm ifJTll . As for /i, it can be easily shown by simulation 
that is either 0, 1, or 2 over 99.7% of the time. Hence, we 
can simply use a set of comparators to determine the value of 
H instead of using a division. For those |/^| greater than 2 are 
saturated to 2, which rarely happened. The BER performance 
of the fixed-point systolic implementation for an 4 x 4 MIMO 
system is shown in Fig. \W\ where 16-QAM modulation and 
ZF-SIC detection are applied. The implementation results are 
shown in Table |III] We consider both QRD and SQRD as 
the pre-processes of the lattice reduction algorithms. From the 
results, ASLR is superior to FSR-LLL in terms of the average 
processing time, and this advantage is significant when QRD 
is applied. The hardware complexity for ASLR and FSR-LLL 
are about the same, since they only differ from each other 
in the external controllers. It is also clear that SQRD reduces 
the average processing time by over 45% comparing to using 
the normal QRD, at the cost of higher computation efforts on 
SQRD. 

In Table Hm the FPGA implementation result for the conven- 
tional complex LLL (CLLL) lfT4l is also listed for comparison. 
Under Virtex 5 and with SQRD, systolic ASLR operates at 
a slightly lower speed than the one of CLLL; however our 
designs require only 61.5% average clock cycles of theirs. As 
a result, ASLR is on average faster than CLLL by a factor of 
1.6. This verifies the high-throughput advantage of the systolic 
arrays. On the other hand, systolic arrays implementation may 
have higher hardware complexity since it requires several 
processing elements to work in parallel. The results in Table Hill 
shows that our designs occupied 36-38% more FPGA slices 
than the one in CLLL. However, as the fast the advance of 
FPGA technology and the semiconductor processing, one may 
consider to trade some areas for a faster processing speed. As 
shown in Table HIH when using the latest Xilinx Virtex 6 FPGA 
device, our systolic designs could run up to 249MHz and it 
only requires less than 10% of the total FPGA slices. 

V. Systolic Array for Detection Methods 

A. Linear Detection in Systolic Array 

After lattice reduction, the matrices and R, along with 
the unimodular matrix T, are stored in the systolic array. As 
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shown in Fig. |2] the first step of a linear detection consists 
of premultiplying the received signal vector y by H^, which 
yields x = &y — R^^Q^y. Second, the result of a matrix- 
vector multiplication needs to be rounded element-wise. The 
final step is to multiply the rounded results by the unimodular 
matrix T and constrain all results within the constellation 
boundary. If denotes the element-wise-rounded x, the final 
decision of the LRAD is xlb. — Q(T • Xg), as described in 
Section HTCl 

In the following discussion, we assume an 4 x 4 MIMO 
system, and consider the zero-forcing detection first. The first 
and last steps of a linear detection can be implemented by the 
same systolic array of Fig. |4] without using extra cells. As for 
the rounding and the final constellation boundary check, they 
should be done outside the systolic array (they are not shown 
in Fig.fTTTl. To execute x = R~^Q^y in the systolic array, we 
separate it into two matrix-vector multiplications v = Q^y 
and then x = R^^v. Since stays in the systolic arrays 
after the lattice reduction ends, the received signal vector y can 
be fed to the systolic arrays from the top in a skewed manner 
as shown in Fig. [TTT a). The vector Q^y is pumped out from 
the rightmost column of the array. Diagonal and off-diagonal 
cells are needed at this stage, and the operations of the cells 
are shown in Fig. |12(a)| Every cell performs the multiply-and- 
add operation. If MMSE is chosen, the input vector should be 
changed to an 2rn x 1 vector y according to the extended 

model (|5]l. Let y = [yj y J] ^ and = [Qi Q2] , where 
yi, y2 are TO X 1 vectors and Qi, Q2 are m x m matrices. 
As mentioned in Section IIV-BI the elements of Qi and Q2 
are stored in the same PEs. To compute v = Q^y using 
the systolic array, first we let yi enter the array from the 
top and multiply it by Qi, which is the same as shown in 
Fig. fTTT a). Then y2 enters the array right after yi, also in a 
skewed manner, and is multiplied by Q2. Hence, for MMSE 
we need an extra operation at the output of the array, which 
is V = Qiyi + Q2y2- For the remaining operations in the 
systolic array, there is no difference between ZF and MMSE 
detections. 

The second stage consists of computing x = R^^v. 
Instead of computing R~^ directly, the following recursive 
equation lf38ll is considered for the systolic design 



j starts from m to 1. (14) 



According to (fT4l i. it is clear that R~'^v can be computed 
directly from the components of R without computing R^^. 
Additionally, it can be implemented by the upper triangle part 
of the systolic array, where matrix R has already been stored. 
As shown in Fig. fTTT b). the vector v ~ Q^y enters the array 
from the right, and x = R~^v is computed by the triangular 
array with cell operations shown in Fig. |12(b)| The output 
vector X is then rounded element-wise outside the systolic 
array. The final step consists of multiplying the quantized 
vector Xq by the unimodular matrix T, which is also stored 
in the array. Similar to the first step of a linear detection, it 
is a matrix- vector multiplication between T and x^. Hence, 
the data flow in Fig. [TTT c) is the same as Fig. [TTT a). The cell 





Fig. 11. The linear detection operations in the systolic array. (a)v = Q^y 
(b)x = R-iv (c)x£B = Q(T • X,). 
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Fig. 12. The detailed operations of the diagonal cells and olf-diagonal cells 
in the systolic array at different stage. (a)Q^y and T • Stq (b)R.^^v. 



operations for T • x^ are shown in Fig. |12(a)[ and the array 
output being quantized to the closest constellation point is the 
final result Sllr of the linear LRAD. 

B. Spatial-Interference Cancellation in Systolic Array 

The successive spatial-interference cancellation (SIC) can 
also be performed on this systolic array with some modi- 
fications to the PEs. Observing the first step of LR-aided 
SIC showing in (|7]i, it should be apparent that Q^y can 
be performed in the systolic array in the same fashion as in 
Fig. [TTl a) and Fig. |12(a)| The second step ([8]l of LR-aided 
SIC can be done in the systolic array as shown in Fig [13] It 
is almost the same operations as the one Fig. |12(b)[ except 
that we have to do a rounding in the off-diagonal cells Oij 
at the super-diagonal position {j — i + 1). The rounding 
operations are for the decision of each z;. Similar to the linear 
LRAD, the final step of LR-aided SIC is to multiply z by the 
unimodular matrix T and bound all the output within the QAM 
constellation. It can be done in the same way as in Fig fTTT c) 
and Fig. |12(a)[ with x^ being replaced by z. 

Notice that lattice reduction and linear detection (or SIC) are 
performed in the same systolic array, and it can be hardware- 
efficient to share the adder/multiplier/divider designed for 
lattice reduction processing. For instance, there is one addition, 
one multiplication, and one division in each diagonal cell, 
and one addition and one multiplication in each off-diagonal 
cell for linear detection or SIC, be it ZF or MMSE. These 
operations are also contained in each cell at the LLL lattice 
reduction stage. For SIC, it seems that we need extra rounding 
operations in those off-diagonal cells at the superdiagonal 
position. Now, we need those rounding operations in the off- 
diagonal cells during the full size reduction processing as 
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Fig. 13. The data flow and the detailed operations of the cells in the systolic 
array for the interference-cancellation step of LR-aided SIC. 



well. Hence, there need be no extra hardware cost (adders 
or multipliers) in each cell for linear detection. Only extra 
control logic to the array is needed in order to have each PE 
work correctly in different modes. 

VI. Conclusion 

In this paper, we have described a systolic array perform- 
ing LLL-based lattice-reduction-aided detection for MIMO 
receivers. Lattice reduction and the ensuing linear detection 
or successive spatial-interference cancellation can be executed 
by the same array, with minimum global access to each 
processing element. The proposed systolic array with external 
logic controller can work with two different lattice-reduction 
algorithms. One is LLL algorithm with full size reduction, 
which is a different form of the conventional LLL algorithm 
and more suitable for parallel processing. The second one 
is an all-swap complex lattice-reduction algorithm, which 
generalizes the one originally proposed in |30| for real lattices. 
Compared to FSR-LLL, ASLR operates on a whole matrix, 
rather than on its single columns, during the column-swap 
and Givens-rotation steps. To reduce the complexity of data 
communications between processing elements in the systolic 
array, we replace Lovasz condition in the LLL algorithm by 
Siegel condition. Even though Siegel condition is weaker than 
Lovasz condition, the BER performance of LR-aided linear 
detections based on our two algorithm versions appears to be 
as good as using the conventional LLL, and the computational 
complexity is reduced by the relaxation as well. Based on BER 
performance and time-efficiency comparisons, ASLR should 
be preferred to FSR-LLL, especially for an MIMO system 
with a large number of antennas. The FPGA implementation 
results also show that our proposed systolic architecture for 
lattice reduction algorithms run about 1.6 x faster than the 
conventional LLL, at the cost of moderate increases of hard- 
ware complexity. Additionally, due to the high- throughput 
property of systolic arrays, our design appears very promising 
for high-data-rate systems, such as in a MIMO-OFDM system. 
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